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Abstract 

Iterative procedures for parameter estimation 
based on stochastic gradient descent (sgd) allow 
the estimation to scale to massive data sets. How¬ 
ever, in both theory and practice, they suffer from 
numerical instability. Moreover, they are statisti¬ 
cally inefficient as estimators of the true param¬ 
eter value. To address these two issues, we pro¬ 
pose a new iterative procedure termed averaged 
implicit SGD fAi-SGDj. For statistical efficiency, 
Ai-SGD employs averaging of the iterates, which 
achieves the optimal Cramer-Rao bound under 
strong convexity, i.e., it is an optimal unbiased 
estimator of the true parameter value. For nu¬ 
merical stability, ai-sgd employs an implicit up¬ 
date at each iteration, which is related to prox¬ 
imal operators in optimization. In practice, ai- 
sgd achieves competitive performance with other 
state-of-the-art procedures. Furthermore, it is 
more stable than averaging procedures that do not 
employ proximal updates, and is simple to imple¬ 
ment as it requires fewer tunable hyperparameters 
than procedures that do employ proximal updates. 

1 Introduction 

The majority of problems in statistical estimation can be 
cast as finding the parameter value 0 * G 0 such that 

0* = argminE (L(6>,^)), (1) 

0 

where the expectation is with respect to the random variable 
^ G 5 C that represents the data, 0 C IRp is the param¬ 
eter space, and L;0x5—J^Kisa loss function. A popular 
procedure for solving Eq.(12) is stochastic gradient descent 
(sgd) (Zhang, 2004; Bottou, 2004), where a sequence 6 >„ 
approximates 0 *, and is updated iteratively, one data point 
at a time, through the iteration 

( 2 ) 
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where ^ 2 ,...} is a stream of i.i.d. realizations of and 
{ 7 ji} is a non-increasing sequence of positive real numbers, 
known as the learning rate. The nth iterate in sgd (2) 
can be viewed as an estimator of 0*. To evaluate such it¬ 
erative estimators it is typical to consider three properties: 
convergence rate and numerical stability, by studying the 
mean-squared errors E (||0„ — 0*|P); and statistical effi¬ 
ciency, by studying the limit nVar (0„), as n —>■ 00 . 

While computationally efficient, the sgd procedure (2) suf¬ 
fers from numerical instability and statistical inefficiency. 
Regarding stability, sgd is sensitive to specification of the 
learning rate 7 „, since the mean-squared errors can diverge 
arbitrarily when 7 „ is misspecified with the respect to prob¬ 
lem parameters, e.g., the convexity and Lipschitz parame¬ 
ters of the loss function (Benveniste et al., 1990; Moulines 
and Bach, 2011). Regarding statistical efficiency, sgd loses 
statistical information. In fact, the amount of information 
loss depends on the misspeciheation of 7 ^ with respect to 
the spectral gap of the matrix E (V^L(0*, ^)) (Toulis et al., 
2014), also known as the Fisher information matrix. Several 
solutions have been proposed to resolve these two issues, 
e.g., using projections and gradient clipping. However, they 
are usually heuristic and hard to generalize. 

In this paper, we aim for the ideal combination of 
computational efficiency, numerical stability, and 
statistical efficiency using the following procedure: 

Sn=9n-l-ln'^L{dn,in), (3) 

AI-SGD „ 

0„ = (l/n)^0,. (4) 

i=l 

Our proposed procedure, termed averaged implicit sgd 
(ai-sgd), is comprised of two inner procedures. The first 
procedure employs updates given in Eq.(3), which are im¬ 
plicit because the iterate appears on both sides of the 
equation. Procedure (3), also known as implicit SGD 
(Toulis et al., 2014), aims to stabilize the updates of the 
classic SGD procedure (2). In fact, implicit sgd can be mo¬ 
tivated as the limit of a sequence of improved classic sgd 
procedures. To see this, first fix the sample history J-n-i = 

• ■ ■ ,Cn-i}, where we use the superscript “s” in 
the classic sgd procedure in order to distinguish from im¬ 
plicit SGD. Then, 0* = ^ 
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If we “trust” 0n'^ to be a better estimate of 0* than 0^_i, 
then we can use 0n'^ instead of 0f^_i in computing the loss 
function at data point This leads to a revised update 
On = (^n -1 - ln'^L{Sn\^n) — ■ Likewise, we can 

use 0n^ instead of 0n \ and so on. If we repeat this argu¬ 
ment ad inhnitum, then we get the following sequence of 
improved sgd procedures, 

0:=0:_l-7nVL(0„_i,Cn), 

0:=0:-l-7nVL(0(~),en), (5) 

where 0n'^ = 0^_i — VL{0n~^'^ j^n), with initial condition 

0n^^ — 0n-i- In ths limit, assuming a unique hxed point is 
reached almost surely, the hnal procedure of sequence (5) 
satisfies - 7 „VL( 6 li°°\ ^„) = This can 

be rewritten as 0 ® = 0 ® — 7 „VL( 0 ®, which is equiv¬ 

alent to implicit sgd. Thus, implicit sgd can be viewed as 
a repeated application of classic sgd, where we keep up¬ 
dating the same iterate 0 ®_i using the same data point ^n, 
until a fixed-point is reached. Nesterov’s accelerated gra¬ 
dient, a popular improvement of classic sgd, is only one 
application of this procedure. 

The stability improvement achieved by implicit updates can 
be motivated by the following argument. Assume for sim¬ 
plicity that L is strongly convex, almost surely, with param¬ 
eter fi > 0. Then for the implicit sgd procedure (3), 

On + L[0n, ^n) = ^n-1, 

||0„-0*|P + 27„(0„-0*)tvL(0„,C„) < \\0n-i-0.\\^ 

\\0n - 0,\\^ < —^\\0n-l - 0.\\^ 

1 + 7nM 

which implies that 11— 0* | p is contracting almost surely. 
In contrast, the classic sgd procedure does not share this 
contracting property. 

While the implicit updates of Eq.(3) aim to achieve stabil¬ 
ity, the averaging of the iterates in Eq.(4) aims to achieve 
statistical optimality. Ruppert (1988) gave a nice intuition 
on why iterate averaging can lead to statistical optimality. 
When the learning rate is 7 „ cx n~^, then 0n — 0* is a 
weighted average of n error variables VL( 0 i_i, which 
therefore are significantly autocorrelated. However, when 
7 „ oc n~^ with 7 G ( 0 , 1 ), then 0 „ — 0 * is the average 
of li^ log n error variables, which become uncorrelated in 
the limit. Thus, averaging improves the estimation accu¬ 
racy. 


1.1 Related work 

The implicit update (3) is equivalent to 

0n = argmin |^||0 - 0 n-i||^ + L{0,^n)\ ■ ( 6 ) 

ese 1 27n J 

Arguably, the first method that used an update similar to ( 6 ) 
for estimation was the normalized least-mean squares fil¬ 
ter of Nagumo and Noda (1967), used in signal processing. 
This update is also used by the incremental proximal method 
in optimization (Bertsekas, 201 1), and has shown superior 
performance to classic sgd both in theory and applications 
(Bertsekas, 2011; Toulis et al., 2014; Defossez and Bach, 
2015; Toulis and Airoldi, 2015). In particular, implicit up¬ 
dates lead to similar convergence rates as classic sgd up¬ 
dates, but are significantly more stable. This stability can 
also be motivated from a Bayesian interpretation of Eq.( 6 ), 
where 0 „ is the posterior mode of a model with the standard 
multivariate normal A/'(0„_i, 7 „/) as the prior, L(0, •) as 
the log-likelihood, and as the observation. 

A statistical analysis of procedure (3) without averaging 
was done by Toulis et al. (2014) who derived the asymp¬ 
totic variance Var (0„) of 0„, and provided an algorithm to 
efficiently solve the fixed-point equation (3) for 0„ in the 
family of generalized linear models, which we generalize 
in this current work. In the online learning literature, Kivi- 
nen et al. (2006) and Kulis and Bartlett (2010) have also 
analyzed implicit updates; Schuurmans and Caelli (2007) 
have further applied implicit procedures on learning with 
kernels. Notably the implicit update ( 6 ) is related to the im¬ 
portance weight updates proposed by Karampatziakis and 
Langford (2010), but the two update forms are not equiva¬ 
lent, and are usually combined in practice (Karampatziakis 
and Langford, 2010, Section 5). 

Assuming that the expected loss i is known, instead of up¬ 
date ( 6 ) we could use the update 

0+= argmm|^||0-0„_i|p-f f(0)| . (7) 

In optimization, this mapping from 0„_i to 0+ in Eq. (7) 
is known as a proximal operator, and is a special instance 
of the proximal point algorithm (Rockafellar, 1976). Thus 
implicit SGD involves mappings that are stochastic versions 
of mappings from proximal operators. The stochastic proxi¬ 
mal gradient algorithm (Singer and Duchi, 2009; Parikh and 
Boyd, 2013; Rosasco et al., 2014) is related but different 
to implicit sgd. In contrast to implicit sgd, the stochastic 
proximal gradient algorithm first makes a classic sgd up¬ 
date (forward step), and then an implicit update (backward 
step). Only the forward step is stochastic whereas the back¬ 
ward proximal step is not. This may increase convergence 
speed but may also introduce instability due to the forward 
step. 
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Interest on proximal operators has surged in recent years 
because they are non-expansive and converge with minimal 
assumptions. Furthermore, they can be applied on non¬ 
smooth objectives, and can easily be combined in mod¬ 
ular algorithms for optimization in large-scale and dis¬ 
tributed settings (Parikh and Boyd, 2013). The idea has also 
been generalized through splitting algorithms (Lions and 
Mercier, 1979; Beck and Teboulle, 2009; Singer and Duchi, 
2009; Duchi et al., 201 1). Krakowski et al. (2007) and Ne- 
mirovski et al. (2009) have shown that proximal methods 
can fit better in the geometry of the parameter space 0, and 
Toulis and Airoldi (2014) have made a connection to shrink¬ 
age methods in statistics. 

Two recent procedures based on stochastic proximal up¬ 
dates are prox-svrg (Xiao and Zhang, 2014) and prox- 
SAG (Schmidt et al., 2013, Section 6). The main idea in 
both methods is to periodically compute an estimate of the 
full gradient averaged over all data points in order to re¬ 
duce the variance of stochastic gradients. This requires a fi¬ 
nite data setting, whereas ai-sgd also applies to streaming 
data. Moreover, the periodic calculations in prox-svrg are 
controlled by additional hyperparameters, and the periodic 
calculations in prox-sag require storage of the full gradi¬ 
ent at every iteration, ai-sgd differs because it employs 
averaging to achieve statistical efficiency, has no additional 
hyperparameters or major storage requirements, and thus it 
has a simpler implementation. 

Averaging of the iterates in Eq.(4) is the other key compo¬ 
nent of AI-SGD. Averaging was proposed and analyzed in 
the stochastic approximation literature by Ruppert (1988) 
and Bather (1989). Polyak and Juditsky (1992) substan¬ 
tially expanded the scope of the averaging method by prov¬ 
ing asymptotic optimality of the classic sgd procedure 
with averaging, under suitable assumptions. Their results 
showed clearly that slowly-convergent stochastic approxi¬ 
mations (achieved when the learning rates are large) need to 
be averaged. Recent work has analyzed classic sgd with av¬ 
eraging (Zhang, 2004; Xu, 2011; Shamir and Zhang, 2012; 
Bach and Moulines, 2013) and has shown their superiority 
in numerous learning tasks. 

1.2 Overview of results 

In this paper, we study the iterates and use the results 
to study 9n as an estimator of 0*. Under strong convex¬ 
ity of the expected loss, we derive upper bounds for the 
squared errors E (||0„ — 0*|p) and E in The¬ 

orem 3 and Theorem 2, respectively. In the supplementary 
material, we also give bounds for E (11— 0* 11 '*). 

Two main results are derived from our theoretical analy¬ 
sis. First, On achieves the Cramer-Rao bound, i.e., no other 
unbiased estimator of 0* can do better in the limit, which 
is equivalent to the optimal 0{l/n) rate of convergence 
for first-order procedures. Second, ai-sgd is significantly 


more stable to misspecification of the learning rate rela¬ 
tive to classic averaged sgd procedures, with respect to 
the learning problem parameters, e.g., convexity and Lips- 
chitz constants. Finally, we perform experiments on several 
standard machine learning tasks, which show that ai-sgd 
comes closer to combining stability, optimality, and sim¬ 
plicity than other competing methods. 

2 Preliminaries 

Notation. Let = {0o, ^ 2 , • • ■, Cn} denote the filtra¬ 

tion that process 0„ (3) is adapted to. The norm 11 • 11 will 
denote the L 2 norm. The symbol = indicates a definition, 

def 

and the symbol = denotes “equal by definition”. For exam¬ 
ple, X = y defines x as equal to known variable y, whereas 

def 

X = y denotes that x is equal to y by definition. We will not 
use this formalism when defining constants. For two posi¬ 
tive sequences an, bn, we write 6„ = 0{an) if there exists a 
fixed c > 0 such that bn < can, for all n; also, = o(a„) 
if5„/a„ —0. When a positive scalar sequence a„ is mono- 
tonically decreasing to zero, we write a„ j, 0. Similarly, for 
a sequence Xn of vectors or matrices, Xn = C9(a„) de¬ 
notes that ||2 l„|| = C9(a„), and = o(an) denotes that 
||-^ra|| = o(an)- For two matrices A, B, A ^ B denotes 
that B — Ais nonnegative-definite; tr(A) denotes the trace 
of A 

We now introduce the main assumptions pertaining to the 
theory of this paper. 

Assumption 1. The loss function is almost-surely 

differentiable. The random vector ^ can be decomposed as 
^ = (x, y), X S y £ such that 

L{9,0 = L{x-^e,y). (8) 

Assumption 2. The learning rate sequence { 7 n} A defined 
as 7 „ = 7 in“^, where 71 > 0 and 7 £ ( 1 / 2 , 1 ]. 
Assumption 3 (Lipschitz conditions). For all 0i, 02 £ 0, a 
combination of the following conditions is satisfied almost- 
surely: 

(a) The loss function L is Lipschitz-continuous with param¬ 
eter Aq, i.e., 

iL(0i,e)-L(02,oi < Ao||0i-02||, 

(b) The map XL is Lipschitz-continuous with parameter 
Al, i.e., 

||VL(0i,e)-VL(02,OII < Ai||0i-02||, 

(c) The map is Lipschitz-continuous with parameter 
X 2 , i.e., 

\\X^L{9^,0-XJ^L{92,m<X2\\ei-92\\. 
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Assumption 4. The observed Fisher information matrix, 
I{0) = has non-vanishing trace, i.e., there ex¬ 

ists (j) > 0 such that tr(X(0)) > f, almost-surely, for all 
0 S 0. The expected Fisher information matrix, T[6) = 

E has minimum eigenvalue 0 < Xf < f, for all 

6» G 0. ~ 

Assumption 5. The zero-mean random variable Wg = 
\7L{9, — is square-integrable, such that, for afixed 

positive-definite E, 

E {Wg,^Wl) ^ E. 

Remarks. Assumption 6 puts a constraint on the loss func¬ 
tion, but it is not very restrictive because the majority of 
machine learning models indeed depend on the parameter 
6 through a linear combination with features. A notable ex¬ 
ception includes loss functions with a regularization term. 
Although it is easy to add regularization to ai-sgd we will 
not do so in this paper because ai-sgd works well without 
it, since the proximal operator (6) already regularizes the 
estimate 9n towards 0„_i. In experiments, regularization 
neither improved nor worsened ai-sgd (see supplementary 
material for more details). Assumption 7 on learning rates 
and Assumption 10 are standard in the literature of stochas¬ 
tic approximations, dating back to the original paper of Rob¬ 
bins and Monro ( 1951 ) in the one-dimensional parameter 
case. 

Assumptions on Lipschitz gradients (Assumption 8(b), As¬ 
sumption 8(c)) can be relaxed; for example, Benveniste 
et al. ( 1990 ) relax this assumption using ||0i — ^ 2 ! How¬ 
ever, these two Lipschitz conditions are commonly used 
in order to simplify the non-asymptotic analysis (Moulines 
and Bach, 2011 ). Assumption 8(a) is less standard in clas¬ 
sic SGD literature but has so far been standard in the limited 
literature on implicit sgd (Bertsekas, 2011 ). We can forgo 
this assumption and still maintain identical rates for the er¬ 
rors, although at the expense of a more complicated analy¬ 
sis. It is also an open problem whether a nice stability result 
similar to Theorem 3 can be derived under Assumption 8(b) 
instead of Assumption 8(a). We discuss this issue after the 
proof of Theorem 3 in the supplementary material. 

Assumption 9 makes two claims. The first claim on the ob¬ 
served Fisher information matrix is a relaxed form of strong 
convexity for the loss L{9, ^). However, in contrast to strong 
convexity, this claim allows several eigenvalues of to 
be zero. The second claim of Assumption 9 is equivalent 
to strong convexity of the expected loss i{9). From a statis¬ 
tical perspective, strong convexity posits that there is infor¬ 
mation in the data for all elements of 0*. This assumption is 
necessary to derive bounds on the errors E (||0„ — 0*|p), 
and has been used to show optimality of classic sgd with 
averaging (Polyak and Juditsky, 1992 ; Ljung et al., 1992 ; 
Xu, 2011 ; Moulines and Bach, 2011 ). 

Overall, our assumptions are weaker than the assumptions 
in the limited literature on implicit sgd. For example, 


Bertsekas (2011, Assumptions 3.1, 3.2) assumes almost- 
sure bounded gradients VL(0,^) in addition to Assump¬ 
tion 8(a); Ryu and Boyd (2014) assume strong convexity 
of L{9, ^), in expectation, which can simplify the analysis 
significantly. We discuss more details in the supplementary 
material after the proof of Theorem 3. 

3 Theory 

In this section we present our theoretical analysis of ai- 
sgd. All proofs are given in the supplementary material. 
The main technical challenge in analyzing implicit sgd (3) 
is that unlike typical analysis with classic sgd (2), the er¬ 
ror is not conditionally independent of 0„. This im¬ 
plies that E {VL{9n,^n) \ On) ^ £{0n), which makes it no 
longer possible to use the convexity properties of i to ana¬ 
lyze the errors E (||0„ — 0*|P), as it is common in the lit¬ 
erature. 

As mentioned earlier, to circumvent this issue other authors 
have made strict almost-sure assumptions on the implicit 
procedure (3) (Bertsekas, 2011; Ryu and Boyd, 2014). In 
this paper, we rely on weaker conditions, namely the Lip¬ 
schitz assumptions 8(a)-8(c), which are also used in non- 
implicit procedures. Our proof strategy relies on a master 
lemma (Lemma 3 in supplementary material) for the analy¬ 
sis of recursions that appear to be typical in implicit proce¬ 
dures. This result is novel to our best knowledge, and it can 
be useful in future research on implicit procedures. 

3.1 Computational efficiency 

Our first result enables efficient computation of the implicit 
update (3). In general, this can be expensive due to solving 
a fixed-point equation in many dimensions, at every itera¬ 
tion. We reduce this multi-dimensional equation to an equa¬ 
tion of only one dimension. Furthermore, under almost- 
sure convexity of the loss function, efficient search bounds 
for the one-dimensional fixed-point equation are available. 
This result generalizes an earlier result in efficient computa¬ 
tion of implicit updates on generalized linear models (Toulis 
et al., 2014, Algorithm 1). 

Definition 1. Suppose that Assumption 6 holds. For ob¬ 
servation ^ = (x, y), the first derivative with respect to the 
natural parameter x'^ 9 is denoted by L'{9, ^), and is defined 
as 

rna A def 5L(a:T0,y) 

d{x^9) • 

Similarly, L''{^,9) = 

Lemma 1. Suppose that Assumption 6 holds, and consider 
functions L', L" from Definition 2. Then, almost-surely, 

VL{9n, U) = S„VL(0„_1, C„); (10) 






Panos Toulis, Dustin Tran, Edoardo M. Airoldi 


the scalar Sn satisfies the fixed-point equation, 

^nttn—l — L — ^n) ; (11) 

where k„_i = Moreover, if L"{0,^) > 0 

almost-surely for all 0 S 0, then 

[ k „_ 1 , 0 ) if tin-1 <0, 

[0, Kn-i] Otherwise. 

Remarks. Lemma 2 has two parts. First, it shows that 
the implicit update can be performed by obtaining s„ from 
the fixed-point Eq.(18), and then using = 

s„VL(0„_i, ^„) in the implicit update (3). The fixed-point 
equation can be solved through a numerical root-finding 
procedure (Kivinen et ah, 2006; Kulis and Bartlett, 2010; 
Toulis et ah, 2014). Second, when the loss function is con¬ 
vex, then narrow search bounds for s„ are available. This 
property holds, for example, when the loss function is the 
negative log-likelihood in an exponential family. 

3.2 Non-asymptotic analysis 

Our next result is on the mean-squared errors 
E - 6'*|p). These errors show the stability and 
convergence rates of implicit sgd and are used in com¬ 
bination with bounds on errors E (||0„ — 0*11"^) to derive 
bounds on the errors E(||0„ —of the averaged 
procedure. ' 

Theorem 1. Suppose that Assumptions 6, 7, iS(a), and 9 
hold. Define = E (||0„ — and constants F^ = 

4Aq X)< oo> e = (1 + A = 1 -f 

7 iA^e. Then, there exists constant tiq > Q such that, for all 
n> 0, 

Sn <{8XhiX/^e)n-^ + [So + A”«r"]. 

Remarks. According to Theorem 3, the convergence rate 
of the implicit iterates is 0(n~'^). This matches ear¬ 
lier results on rates of classic sgd (Benveniste et al., 1990; 
Moulines and Bach, 201 1). The most important difference, 
however, is that the implicit procedure discounts the initial 
conditions (5o at an exponential rate, regardless of the spec¬ 
ification of the learning rate. As shown by Moulines and 
Bach (2011, Theorem 1), in classic sgd there exists a term 
exp(Af 7 in^“^^) in front of the initial conditions, which 
can be catastrophic if the learning rate parameter 71 is mis- 
specified. In contrast, the implicit iterates are uncondition¬ 
ally stable, i.e., any specification of the learning rate will 
lead to a stable discounting of the initial conditions. 


* The bounds for the fourth moments E (||0„ — 0*11"^) are 
given in the supplementary material because they rely on the same 
intermediate results as E (| 1— 0*11^). 


Theorem 2. Consider the ai-sgd procedure (4), and sup¬ 
pose that Assumptions 7, S(a), S(c), 9, and 10 hold. Then, 

(E(||0„-0*||2))i/2<^(tr(v2f(0*)-iSV2f(0*)-i))'/" 

-bC>(exp(-logA-ni-^/2). 

Remarks. The full version of Theorem 2, which includes 
all constants, is given in the supplementary material. Even 
in its shortened form. Theorem 2 delivers three main re¬ 
sults. Eirst, the iterates attain the Cramer-Rao lower 
bound, i.e., any other unbiased estimator of 0* cannot have 
lower MSE than Erom an optimization perspective, 

6n attains the rate 0{l/n), which is optimal for first-order 
methods (Nesterov, 2004). This result matches the asymp¬ 
totic optimality of averaged iterates from classic sgd pro¬ 
cedures, which has been proven by Polyak and Juditsky 
(1992). 

Second, the remaining rates are and 0(n~^'^). 

This implies the optimal choice 7 = 2/3 for the expo¬ 
nent of the learning rate. It extends the results of Ruppert 
(1988), and more recently by Xu (2011), and Moulines and 
Bach (2011), on optimal exponents for classic sgd proce¬ 
dures. 

Third, as with non-averaged implicit iterates in Theorem 3, 
the averaged iterates have a decay of the initial condi¬ 
tions regardless of the specification of the learning rate pa¬ 
rameter. This stability property is inherited from the un¬ 
derlying implicit SGD procedure (3) that is being averaged. 

In contrast, averaged iterates of classic sgd procedures can 
diverge numerically because arbitrarily large terms can ap¬ 
pear in front of initial conditions (Moulines and Bach, 2011, 
Theorem 3). 

4 Experiments 

In this section, we show that ai-sgd achieves compara¬ 
ble, and sometimes superior, results to other methods while 
combining statistical efficiency, stability, and simplicity. In 
our experiments, we compare our procedure to the follow¬ 
ing procedures; 

• sgd: Classic stochastic gradient descent in its standard 
formulation (Sakrison, 1965; Zhang, 2004), which em¬ 
ploys the update 6»„ = 6»„_i - 7 „VL( 0 „_i,^„). 

• IMPLICIT sgd: Stochastic gradient descent procedure 
introduced in Toulis et al. (2014) which employs im¬ 
plicit update (3) without averaging. It is robust to 
mis specification of the learning rate but also exhibits 
slower convergence in practice relative to classic sgd. 

• asgd: Averaged stochastic gradient descent procedure 
with classic updates of the iterates (Xu, 2011; Shamir 
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and Zhang, 2012; Bach and Moulines, 2013). This is 
equivalent to ai-sgd where the update (3) is replaced 
by the classic step = 6 >„_i - 7 „VL( 0 „_i, ,f„). 

• prox-svrg: a proximal version of the stochastic gra¬ 
dient descent procedure with progressive variance re¬ 
duction (SVRG) (Xiao and Zhang, 2014). 

• prox-sag; a proximal version of the stochastic aver¬ 
age gradient (SAG) procedure (Schmidt et al., 2013). 
While its theory has not been formally established, 
PROX-SAG has shown similar convergence properties 
to PROX-SVRG in practice. 

• adagrad: a stochastic gradient descent procedure 
with a form of diagonal scaling to adapt the learning 
rate (Duchi et al., 2011). 

Note that prox-svrg and prox-sag are applicable only 
to fixed data sets and not to the streaming setting. There¬ 
fore the theoretical linear convergence rate of these meth¬ 
ods refers to convergence to an empirical minimizer (e.g., 
maximum likelihood, or maximum a-posteriori if there is 
regularization), and not to the ground truth 0*. On the other 
hand, ai-sgd can be applied to both data settings. 

We also note that adagrad, and similar adaptive 
schedules, (Tieleman and Hinton, 2012; Kingma and 
Ba, 2015) effectively approximate the natural gradient 
by using a multi-dimensional learning 
rate. These learning rates have the added advantage of be¬ 
ing less sensitive than one-dimensional rates to tuning of 
hyperparameters; they can be combined in practice with ai- 
sgd. 

4.1 Statistical efficiency and stability 

We first demonstrate the theoretical results on the stabil¬ 
ity and statistical optimality of ai-sgd. To do so, we fol¬ 
low a simple normal linear regression example from Bach 
and Moulines (2013). Let N = 10® be the number of ob¬ 
servations, and p = 20 be the number of features. Let 
0* = (0, 0,..., 0)''^ be the ground truth. The random vari¬ 
able ^ is decomposed as = {xn, yn), where the feature 
vectors xi,... ,xn ^ A/’p(0, H) are i.i.d. normal random 
variables, and H is a randomly generated symmetric ma¬ 
trix with eigenvalues 1/fc, for fc = 1, ... ,p. The outcome 
Pn is sampled from a normal distribution as | ^ 

1), for n = 1,..., W. Our loss function is de¬ 
fined as the squared residual, i.e., L{6, ^„) = (y„ — xJ^6)'^, 
and thus £( 6 ») = E(L( 6 i, 0 ) = {0-0^yH{9-9^). 

We choose a constant learning rate 7 „ = 7 according to 
the average radius of the data = trace(iL), and for both 
ASGD and AI-SGD we collect iterates 0„, n = 1,..., A^, 
and keep the average 0„. In Figure 1, we plot f (0„) for each 
iteration for a maximum of N iterations in log-log space. 



Figure 1; Loss of ai-sgd, asgd, and implicit sgd, on 
simulated multivariate normal data with N = 10^ observa¬ 
tions, d = 20 features. The plot shows that ai-sgd achieves 
stability regardless of the specification of the learning rate 
7 „ = 7 . In contrast, asgd diverges when the learning rate 
is only slightly misspecified (e.g., solid, blue line). 

Figure 1 shows that ai-sgd performs on par with asgd for 
the rates at which asgd is known to be optimal. However, 
the benefit of the implicit procedure (3) in ai-sgd becomes 
clear as the learning rate increases. Notably, ai-sgd re¬ 
mains stable for learning rates that are above the theoretical 
threshold, i.e., when 7 > 1/i?^, whereas asgd diverges 
above that threshold, e.g., when 7 = 2/R?. This stable be¬ 
havior is also exhibited in implicit sgd, but implicit sgd 
converges at a slower rate than ai-sgd, and thus does not 
combine stability with statistical efficiency. This behavior 
is also reflected for ai-sgd when using decaying learning 
rates, e.g., 7 „ cx 1/n. 

4.2 Classification error 

We now conduct a study of ai-sgd’s empirical perfor¬ 
mance on standard benchmarks of large-scale linear classi¬ 
fication. For brevity, we display results on four data sets, al¬ 
though we have seen similar results on eight additional ones 
(see the supplementary material for more details). 

Table 2 displays a summary of the data sets. The COVTYPE 
data set (Blackard, 1998) consists of forest cover types in 
which the task is to classify class 2 among 7 forest cover 
types. DELTA is synthetic data offered in the PASCAL 
Large Scale Challenge (Sonnenburg et al., 2008) and we ap¬ 
ply the default processing offered by the challenge organiz¬ 
ers. The task in RCVl is to classify documents belonging 
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4.3 Sensitivity analysis 

We examine the inherent stability of the aforementioned 
procedures by perturbing their hyperparameters. That is, 
we perform sensitivity analysis by varying any hyperpa¬ 
rameters that the user must tweak in order to fine tune the 
convergence of each procedure. We do so for hyperparam¬ 
eters in ASGD (the learning rate), prox-svrg (proximal 
step size p and inner iteration m), and ai-sgd (the learn¬ 
ing rate). 
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Figure 2: Large scale linear classification with log loss on 
four data sets. Each plot indicates the test error of various 
stochastic gradient methods over a single pass of the data. 


to class CCAT in the text dataset (Lewis et al., 2004), where 
we apply the standard preprocessing provided by Bottou 
(2012). In the MNIST data set (Le Cun et al., 1998) of 
images of handwritten digits, the task is to classify digit 9 
against all others. 


For AI-SGD and asgd, we use the learning rate 7 ^ = 
770(1 + 77077 )“^/^ prescribed in Xu (201 1), where the con¬ 
stant 770 is determined through preprocessing on a small 
subset of the data. Hyperparameters for other methods are 
set based on a computationally intensive grid search over 
the entire hyperparameter space; this includes step sizes for 
PROX-SAG, PROX-SVRG, and ADAGRAD, and the inner it¬ 
eration count for prox-svrg. For all methods we use L 2 
regularization with parameter A which varies for each data 
set, and which is also used in Xu (2011). 
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Figure 3: Top: Logistic regression on the RCVl dataset, 
performing sensitivity analysis of ai-sgd and asgd for 
the choice of regularization parameter A. Bottom: linear 
SVM on the covtype dataset, performing sensitivity anal¬ 
ysis of AI-SGD and prox-svrg, in which prox-svrg has 
additional hyperparameters 77 according to the step size of 
the proximal update and m according to the inner iteration 
count. 


The results are shown in Figure 2. We see that ai-sgd 
achieves comparable performance with the tuned proximal 
methods prox-svrg and prox-sag, as well as adagrad. 
All methods have a comparable convergence rate and take 
roughly a single pass in order to converge. Interestingly, 
ADAGRAD exhibits a larger variance in its estimate than the 
proximal methods. This comes from the less known fact 
that the learning rate in adagrad is a suboptimal approxi¬ 
mation of the Fisher information, and hence it is statistically 
inefficient. 


The results are shown in Figure 3. When we decrease 
the regularization parameter, asgd performs increasingly 
worse. While it may converge, the test error can be arbitrar¬ 
ily large. On the other hand, ai-sgd always achieves con¬ 
vergence and is not affected by the choice of the hyperpa¬ 
rameter. When the regularization parameter is about 1/N, 
e.g., when A < 1e-6, asgd remains stable and achieves the 
same performance as ai-sgd. Similar results hold when 
perturbing the hyperparameters 77 and m in prox-svrg, as 
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description 

type 

features 

training set 

test set 

A 

covtype 

forest cover type 

sparse 

54 

464,809 

116,203 

10"*^ 

delta 

synthetic data 

dense 

500 

450,000 

50,000 

10-2 

rcvl 

text data 

sparse 

47,152 

781,265 

23,149 

10-’^ 

mnist 

digit image features 

dense 

784 

60,000 

10,000 

10-3 


Table 1: Summary of data sets and the L 2 regularization parameter, following the settings in Xu (201 1). 


Ai-SGD does not require specification of such hyperparam¬ 
eters. 

5 Conclusion 

We propose a statistical learning procedure, termed ai-sgd, 
and investigate its theoretical and empirical properties, ai- 
sgd combines simple stochastic proximal steps, also known 
as implicit updates, with iterate averaging and larger step- 
sizes. The proximal steps allow ai-sgd to be significantly 
more stable compared to classic sgd procedures, with or 
without averaging of the iterates; this stability comes at vir¬ 
tually no computational cost for a large family of machine 
learning models. Furthermore, the averaging of the iterates 
lead AI-SGD to be statistically optimal, i.e., the variance of 
the iterate of ai-sgd achieves the minimum Cramer-Rao 
lower bound, under strong convexity. Last but not least, ai- 
sgd is as simple to implement as classic sgd. In compar¬ 
ison, other stochastic proximal procedures, such as prox- 
SVRG or PROX-SAG, require tuning of hyperparameters that 
control periodic calculations over the entire dataset, and 
possibly storage of the full gradient. 
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A Note 

Lemmas 1, 2, 3 and 4, and Corollary 1, were originally derived by Toulis and Airoldi (2014). These intermediate results 
(and Theorem 1) provide the necessary foundation to derive Lemma 5 (only in this supplement) and Theorem 2 on the 
asymptotic optimality of which is the key result of the main paper. We fully state these intermediate results here for 
convenience but we point the reader to the aforementioned reference for the proofs and for more details on the theory of 
(non-averaged) implicit stochastic gradient descent (implicit SGD). 

B Introduction 

Consider a random variable ^ G S, a parameter space 0 that is convex and compact, and a loss function L : 0 x S —K. 
We wish to solve the following stochastic optimization problem: 

0* = argminE(L(6»,^)), (12) 

where the expectation is with respect to Dehne the expected loss, 

£(0)=E(L(0,e)), (13) 

where L is differentiable almost-surely. In this work we study a stochastic approximation procedure to solve (12) dehned 
through the iterations 


0„=0n-l-7nVL(0^,e„), 00 ee, (14) 

1 " 

0n=-V0z, (15) 

n 

i=\ 

where {^ 1 ,^ 2 , • • ■} are i.i.d. realizations of and VL(0,^„) is the gradient of the loss function with respect to 0 given 
realized value The sequence { 7 ^} is a non-increasing sequence of positive real numbers. We will refer to procedure 
defined by (14) and (15) as averaged implicit stochastic gradient descent, or ai-sgd for short. Procedure ai-sgd combines 
two ideas, namely an implicit update in Eq. (14) as 0„ appears on both sides of the update, and averaging of the iterates 0„ 
inEq. (15). 

C Notation and assumptions 

Let iFn = {fi'o) Cl) ^ 2 ) • • ■) Cn} denote the filtration that process 0„ (14) is adapted to. The norm 11 • 11 will denote the L 2 
norm. The symbol = indicates a dehnition, and the symbol denotes “equal by definition”. Eor example, x = y dehnes 

def 

X as equal to known variable y, whereas x = y denotes that x is equal to y by dehnition. We will not use this formalism 
when dehning constants. Eor two positive sequences a„, we write = 0{an) if there exists a hxed c > 0 such that 
bn 5: ca„, for all n; also, bn — o{an) if bnjon 0. When a positive scalar sequence a„ is monotonically decreasing to 
zero, we write an i 0. Similarly, for a sequence of vectors or matrices, = 0{an) denotes that ||2f„|| = C2(a„), 
and Xn = o(a„) denotes that ||A„|| = o(an)- Eor two matrices A,B, A < B denotes that B — A is nonnegative-dehnite; 
tr(A) denotes the trace of A. 

We now introduce the main assumptions pertaining to the theory of this paper. 

Assumption 6. The loss function L{9,^) is almost-surely differentiable. The random vector ^ can be decomposed as 
^ = (x, y), X G y G such that 


L{e,O = L{x^0,y). (16) 

Assumption 7. The learning rate sequence {7n} A defined as 7„ = , where 71 > 0 and 7 G (1/2,1]. 

Assumption 8 (Lipschitz conditions). For all 0i, 02 € 0, a combination of the following conditions is satisfied almost- 
surely: 

(a) The loss function L is Lipschitz-continuous with parameter Xq, i.e.. 


|L(0i,e)-L(02,C)l < Ao||0i-02|| 
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(b) The map VL is Lipschitz-continuous with parameter Ai, i.e., 

||VL(0i,C)-VL(02,OII < Ai||0i-02||, 


(c) The map is Lipschitz-continuous with parameter A 2 , i.e., 

||V2L(0i,C)-V2L(02,e)ll < A 2 II 01 - 02 II. 

Assumption 9. The observed Fisher information matrix, I{0) = has non-vanishing trace, i.e., there exists 

(j) > 0 such that tr(X(0)) > (f>, almost-surely, for all 9 G Q. The expected Fisher information matrix, T{9) = E 
has minimum eigenvalue 0 < Xf < f, for all 9 € Q. 

Assumption 10. The zero-mean random variable We = X/L{9,^) — V£(0) is square-integrable, such that, for a fixed 
positive-definite S, 

E(We,WjJ ^ S. 


D Proof of Lemma 2 


Definition 2. Suppose that Assumption 6 holds. For observation ^ = (x, y), the first derivative with respect to the natural 
parameter x'^9 is denoted by L' [9, ^), and is defined as 


T'(9 n 4 drf dL{x^9,y) 

^ d{xT9) d{xW) 

Similarly, L"{^,9) = 


(17) 


Lemma 2. Suppose that Assumption 6 holds, and consider functions L', L" from Definition 2. Then, almost-surely, 

= s„VL(0„_i,C„); (18) 

the scalar s„ satisfies the fixed-point equation, 

— dij {9 yi—\ ^n) ; (19) 


where Kn-i — L'{9n-i,^n). Moreover, ifL''{9,f) > 0 almost-surely for all 0 S 0, then 


Sn G 


[ 0 , 


ifKn-1 < 0, 

Otherwise. 


Proof. See Toulis and Airoldi (2014, Theorem 4.1). □ 

E Proof of Theorem 3 

E.l Useful lemmas 

In this section, we will present the intermediate lemmas on recursions that will be useful for the non-asymptotic analysis of 
the implicit procedures. 

Lemma 3. Consider a sequence bn such that 0 and ~ Then, there exists a positive constant K > 0, such 

that 


IL .. IL 

nY:^^exp(-iT^6,). 

i=l ® i=r 


( 20 ) 
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Proof. See Toulis and Airoldi (2014, Lemma B.l). □ 

Lemma 4. Consider scalar sequences an f 0,bn f 0, and c„ 0 such that, an = o{bn), and A = Oi < oo. Suppose 

there exists n' such that Cnjbn < ^for all n > n'. Define, 


1 


A ^n—1 


6n — (n^n—l/^n —1 ^n/bn') and (^n — , ? 

t)n—\ On 

and suppose that 0 and (n i 0. Fix uq > 0 such that (5„ + Cn < 1 ond (1 + c„)/(l + &„) < 1, for all n > tiq. 
Consider a positive sequence Pn > 0 that satisfies the recursive inequality, 

1 + c„ 

Un ^ T , Vn—l + On- 
1 + On 


( 21 ) 


( 22 ) 


Then, for every n > 0, 

yn<ifo^+Q5*J/0 + Q”„ + l(l+Cl)"M, (23) 

where Kq = (1 + 6i) (1 — Sng — Cno) ond Q" = YYj=iO^ + Ci)/(1 + bf), such that Q" = 1 ifn < i, by definition. 


Proof. See Toulis and Airoldi (2014, Lemma B.2). □ 

Corollary 1. In Lemma 4 assume an = ain~°‘ and bn = bin~^, and Cn = 0, where ai,bi, (3 > 0 and max{/3,1} < a < 
1 + /3, and j5 \. Then, 

Vn < + exp(- log(l + bi)n^-^)[yo + (1 + (24) 

Ol 

where no > 0 and A = ai < oo. If fi = 1 then the above inequality holds by replacing the term with logn. 


Proof. See Toulis and Airoldi (2014, Corollary B.l). 

Lemma 5. Suppose Assumptions 6, 8(a), and 9 hold. Then, almost surely, 

1 

“ 1 + Inf' 

||^n~^n-l|| < 4 Aq7„, 

where Sn is defined in Lemma 2, and On is the nth iterate of implicit SGD (14) 

Proof. See Toulis and Airoldi (2014, Lemma B.3). □ 

Theorem 3. Suppose that Assumptions 6, 7, S(a), and 9 hold. Define Sn = E(||0„ — 0*|p), and constants T^ = 
4Aq ^ < oo, e = (1 + 7 i((/> — A/))“^, and A = 1 + 71 A/e. Then, there exists constant uq > 0 such that, for all 
n > 0, 

Sn <(8A27iA/A/e)n-'^ + + A”«r2]. 


□ 

(25) 

(26) 


Proof. See Toulis and Airoldi (2014, Theorem 3.1). □ 

Remarks. #1. Assuming Lipschitz continuity of the gradient VL instead of function L, i.e.. Assumption 8(b) over As¬ 
sumption 8(a) would not alter the main result of Theorem 3 about the 0{n~'^) rate of the mean-squared error. Assuming 
Lipschitz continuity with constant Ai of VL and boundedness of E (| | VL(0*, ^„)| p) < as it is typical in the literature. 
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would simply add a term 7 ^AfE (||0™ — + 7 ^ 0 "^ in the corresponding recursive inequality. Specifically, by Lemma 

2, s„ < 1, and thus 

E(||VL(0„,e„)lP) =E(4l|VL(0„-i,Cn)|P) <E(||VL(0„_i,e„)|P) 

= E(||VL(0„_i,en) - VL(0*,en) + VL(0*,C„)||2) 
<A?E(||0„_i-0*||2)+72E(||VL(0*,en)in 

<A?E(|| 0 „_i- 0 *|| 2 )+ 72 ^ 2 _ (27) 

The recursion for the implicit errors would then be 

E (lie - O.W^) < \ + A?7^)E (lie-i - 0*11^) + 7>^ 

^ 1 + 7nA/e 


which also implies the 0{n '>') convergence rate. However, it is an open problem whether it is possible to derive a nice 
stability property for implicit SGD under Assumption 8(b) similar to the result of Theorem 3 under Assumption 8(a). 

Remarks. #2. An assumption of almost-sure convexity can simplify the analysis significantly. For example, similar to the 
assumption of Ryu and Boyd (2014), assume that L{9,^) is convex almost surely such that 

(0„ - 0*)"VL(0„,C„) > ^110™ - ^*11", (28) 


where /i„ > 0 and E (/i„) = fi > 0. Then, 


9n + 27 „VL( 0 „, = 9n-i [by definition of implicit SGD (14)] 

||0„ - 0*|p + 27„(0„ - 9^yvL{9n,U < ll^n-i - e^?. 

(1 + 7nMn)||^n ~ ^*|P < ||(*n-l ~ ^*|P- 

E(||0„-0*in < —^-E(||0„_1-0*||2) +SD(l + 7„/i„)SD(||0„-0*|n, (29) 

-L H" 'YnM 

where the last inequality follows from the identity E {XY) > E {X) E (L) — SD(A')SD(y). However, SD(1 + 7n/in) = 
0 ( 7 „), and assuming bounded we get 

E(||0„-0.|p) < —^-E(||0 „_i-0*||2)+O(7„), (30) 

which indicates a fast convergence towards 0*. It is also possible to work with the recursion 


ll^n 


1 


1 + 7nMi 


-||^n-l - Oi, 


(31) 


and then use a stochastic version of Lemma 4 although the analysis would be more complex in this case. 


F Proof of Theorem 5 

In this section, we prove Theorem 5 . To do so, we need bounds for E (11— 0* | p), which are available through Theorem 
3, but also bounds for E (11— 0* 11^), which are established in the following lemma. 

Theorem 4. Suppose that Assumptions 6, 7, S(a), and 9 hold. For a constant > 0, define = E (||0„ — 0*|P), and 
constants ^ 7 ? < cx), e = (1 + 7 i((^ — A/))“^, and A = 1 + 71 A/e. Then, there exists constant uq such that, 

for all n > 0, 


Cn <(2iT37i"A/A/e)n-2^ + e-'°s^'"'-"[Co + A^^A^]. 
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Proof. Define Wn — SniO'ffi — L{9n-iT^n) for compactness, and proceed as folllows, 

lie - 0*|p = llei - - 27nS„(e-l - +7^l|VL(0„,en)|P 

lie - 0*11" = II 0 L -1 - 0 * 11 " - 27„W^„ +72||VL(0„,en)ll" [by definition] 

110™ - 0*11" < ||0™i - 0*11" - 27„T^„ +4Ag7^, 

||0™ - 0*11" < ||0™ 1 - 0*11" + + 16Ah^ 

- 27„||ei - 0*||"M^n + 4A272||ei - 0*11" - 8Xl-f^W„. (32) 

By Lemma 4 we have 

E((L„|e-i) > - i r ,, e - d |0n-i-0*||". (33) 

2(1 + 7„(/)) 

Furthermore, 

E (vF^i "=1^ E ([s„(ei - 0*)"vL(ei,en)]"| e-i) 

'= E {[i9n-l - 0*)^VL(6>„,^„)]^| Tn-i) [by Lemma 1] 

< ||0ri_i — 0*1 I^E (I I VL(0„, 1 1 " I-^n-l) [by Cauchy-Schwarlz inequality] 

< 4Aq||0„_i — 0*|p [byLemmad] (34) 


Define Bn = E (110„ — 0* | p) for notational brevity. We use results (33) and (34) to get 

E (||0„ - 0*11") < (^1 - E (||0„-i - 0*11") + AXl^lih - + 16A^7;^ 

E (||0„ - 0*11") < (^1 - E (||0„-i - 0*11") + 2QXhlBn-i + I 6 A 474 

E (||0„ - 0*||'‘) < —-r—E (|| 0„_1 - 0*11“^) + 20Ao7^iln-l + I 6 A 07 ". [byAssumption 9] 

t +InAfe 

E(||0n-0*||") < , , \ E (||0„_1-0*11") +e7n + e-‘°g^'"''^j^i + 1 ^ 27 ", [byTheoremd] (35) 

t +InAfe 

where A= (l + 7 i((/) — A/))“^ and F^ = 4 Aq 7z"’ (^® ^ Theorem 3), Kq = IfiOAgA/A/, Ki = 20Aq(E (||0o 
A rior 2 ), gjj(j = 16 Aq, and no is a constant defined in the proof of Theorem 3. 

Now, define 

g- log A p^(n)itri 

K3 = Ko + K2J1 + max{-- }, 

In 

which exists and is finite. Through simple algebra it is easy to verify that 

ifo 7 n + + 1^27" < Ko-tl 

for all n. Therefore, we can simplify Ineq. (35) as 

E(||0„-0*||") < . \ E(||0„_i-0*||^)+jT37^ (38) 

t +InAfe 

We can now apply Corollary 1 with a„ = K^jn ^nd bn = 7 nA/e to derive the final bounds for E (| |0„ — 0*11“^). □ 

We now evaluate the mean squared error of the averaged iterates, 0„. 


- 0 * 11 ") + 


(36) 


(37) 
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Theorem 5. Consider the ai-sgd procedure 15 and suppose that Assumptions 6, 7, iS(a), 8(c), 9, and 10 hold. Then, 


1 




(E {\% - < (trace(V2£(0*)-^SV2£(0*)-i)) 

y/n 

71 ~ 

+ ^ 7 /^^ [^0 + 

^7n 


A 2 




V 

A 2 


2nXf 


1/2 


(2K37?A/^e)^/"n-^ 

[Co + A"‘>-^A3]i/2i^2W. 


(39) 


where ^^ 2 ( 11 ) = (~ ^ ^constants A, e, Uq i, Jg, are defined in Theorem 3 (susbtituting ng 

for no.i), and ^g, no, 2 ) are defined in Theorem 4, substituting (no for ng 2 )- 


Proof We leverage a result shown for averaged explicit stochastic gradient descent. In particular, it has been shown that 
the squared error for the averaged iterate satishes: 

(E (||0„ - (trace(V2£(0*)-^SV2f(0*)-^))'^' 

+ (E(||g„-^.|p))V^ 

^ ' n-in 
\ ^ 

+ (40) 

i=i 

The proof technique for (40) was hrst devised by Polyak and Juditsky (1992), but was later rehned by Xu (2011), and 
Moulines and Bach (2011). In this paper,we follow the formulation of Moulines and Bach (2011, Theorem 3, page 20); 
the derivation of Ineq.(40) for the implicit procedure is identical to the derivation for the explicit one, however the two 
procedures differ in the terms that appear in the bound (40). 

All such terms in (40) have been bounded in the previous sections. In particular, we can use Theorem 3 for E (||0„ — 0*|p); 
we can also use Theorem 5 and the concavity of the square-root to derive 

n n 

^(E (||0, - ^ ((2iT37i"A/A/e)i/2r^ + e"'°s^'*""/2[Cg + A"«-=A3]1/2) 

< (2iT37?A/^e)i/2„i-7 + K2in)[Co + (41^ 

where K 2 (n) = X]"=i Co = E (||^o ~ ^nd IX?,no ,2 are dehned in Lemma ??, substituting ng 

for no, 2 - Similarly, using Theorem 3, 

(E (||0„ - < (8A§7iA/^e)l/2„-7/2 ^-losX-n^-y2^g^ ;^no,ip2]l/2^ 

where 5g = E (| |0„ — 0*| p), and ng 1 , F^ are dehned in Theorem 3, substituing no,i for ng. These two bounds can be used 
in Ineq.(40) and thus yield the result of Theorem 5. □ 


G Data sets used in experiments 

Table 2 includes a full summary of all data sets considered in our experiments. The majority of regularization parameters 
are set according to Xu (201 1). 
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description 

type 

features 

training set 

test set 

A 

covtype 

forest cover type 

sparse 

54 

464,809 

116,203 

10 

delta 

synthetic data 

dense 

500 

450,000 

50,000 

10 

rcvl 

text data 

sparse 

47,152 

781,265 

23,149 

10 

mnist 

digit image features 

dense 

784 

60,000 

10,000 

10 

sido 

molecular activity 

dense 

4,932 

10,142 

2,536 

10 

alpha 

synthetic data 

dense 

500 

400k 

50k 

10 

beta 

synthetic data 

dense 

500 

400k 

50k 

10 

gamma 

synthetic data 

dense 

500 

400k 

50k 

10 

epsilon 

synthetic data 

dense 

2000 

400k 

50k 

10 

zeta 

synthetic data 

dense 

2000 

400k 

50k 

10 

fd 

character image 

dense 

900 

1000k 

470k 

10 

ocr 

character image 

dense 

1156 

1000k 

500k 

10 

dna 

DNA sequence 

sparse 

800 

1000k 

1000k 

10 


Table 2: Summary of data sets and the L 2 regularization parameter A used 




